-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: read_csv adding additional columns as integers instead of strings #47137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -1304,8 +1304,10 @@ cdef class TextReader: | |||
if self.header is not None: | |||
j = i - self.leading_cols | |||
# generate extra (bogus) headers if there are more columns than headers | |||
# These should be strings, not integers, because otherwise we might get | |||
# issues with callables as usecols GH#46997 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if these are truely bogus, why are they even passed to the usecols callable? There are no circumstances where these "auto-generated" column names are/should be in the result?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see #47138, this is the bogus case
You can access them when using an index-based usecols setting, but the column name is shifted afterwards.
Edit: If we want to change this (I think I would be in favor of that), we have to deprecate first
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but the column name is shifted afterwards.
right, so if they don't currently make their way into the result, we are not at risk of changing existing behavior with this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No.
At least I can not see a way in how this would impact anything.
You can not select these columns by name via usecols. If you select them by position, an available name is used, not the bogus name.
The only impact this should have is consistensy when feeding them into the usecols callable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least I can not see a way in how this would impact anything.
cool.
doc/source/whatsnew/v1.5.0.rst
Outdated
@@ -809,6 +809,7 @@ I/O | |||
- Bug in :func:`read_parquet` when ``engine="pyarrow"`` which caused partial write to disk when column of unsupported datatype was passed (:issue:`44914`) | |||
- Bug in :func:`DataFrame.to_excel` and :class:`ExcelWriter` would raise when writing an empty DataFrame to a ``.ods`` file (:issue:`45793`) | |||
- Bug in :func:`read_html` where elements surrounding ``<br>`` were joined without a space between them (:issue:`29528`) | |||
- Bug in :func:`read_csv` adding columns as integers instead of string when data is longer than header leading to issue with ``usecols`` (:issue:`46997`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you reword so others are not confused like I was. "adding columns as integers" is misleading?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, but this is tricky. Technically this is an implementation detail that should not leak into the outside world...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm ex @simonjayhawkins comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm merge on green
pandas-dev#47137) * BUG: read_csv adding additional columns as integers instead of strings * Reword whatsnew
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.All columns are interpreted as strings, even if they are numeric. So, we have to be consistent with the bogus columns